Notebook initialisation and example files

Plotly imports

In the notebook, import plotly and use enable direct plotting in the current Notebook.

In [28]:
import plotly.offline as py
py.init_notebook_mode (connected=True)

PycoQC import

Import pycoQC main class

In [29]:
from pycoQC.pycoQC import pycoQC

Example files

pycoQC repository contains 6 example sequencing summary files generated with various version of Albacore. Each of those files contains only 10,000 reads.

  • ./tests/data/Albacore-1.2.1_basecall-1D-DNA_small_sequencing_summary.txt.gz
  • ./tests/data/Albacore-1.2.3_basecall-1D-RNA_small_sequencing_summary.txt.gz
  • ./tests/data/Albacore-1.7.0_basecall-1D-DNA_small_sequencing_summary.txt.gz
  • ./tests/data/Albacore-2.1.10_basecall-1D-DNA_small_sequencing_summary.txt.gz
  • ./tests/data/Albacore-2.1.10_basecall-1D-RNA_small_sequencing_summary.txt.gz
  • ./tests/data/Albacore-2.3.1_basecall-1D-RNA_small_sequencing_summary.txt.gz

Using pycoCQ

General information

pycoQC is a simple class that is initialized with a sequencing_summary file generated by ONT Albacore.

The instantiated object can be subsequently called with various methods that will generates tables and plots.

There are a few different ways to get help for all the public package functions:

  • In a separate window with the jupyter magic "?": ?pycoQC.channels_activity
  • In an output cell with the standard help function: help (pycoQC.channels_activity)
  • Inline With the cursor on the function of interest use shift + tab

All the plots are generated with the offline version of plotly for Python.

All the plotting methods return a plotly Figure object that can be used by users for further customization or export in various format.

In addition, users can also customize the figures online in a user friendly environment by clicking on "Edit in Chart Studio" in the upper right corner of each figures.

Similarly static pictures can be exported using the "Download plot as a png" button.

Initialisation

Upon initialization pycoQC reads the sequencing summary file, runs a series of tests and pre-process the data for plotting methods.

Sequencing_summary file

PycoQC can read compressed sequencing_summary.txt files (‘gzip’, ‘bz2’, ‘zip’, ‘xz’) and can load a summary file directly from an URL

Depending on the run type and the version of Albacore used some informations might not be available. In particular calibration reads were not flagged in earlier version of Albacore. When the field is available those reads are automatically discarded. Similarly barcodes information are only available in multiplexed runs.

Run type

The type of run (1D or 1D2) is automatically detected but can be explicitly enforced with run_type if needed

Run ID reordering

There is often several runids are present in a single sequencing_summary file. Unfortunately there are no ways to know the correct order based on the information contained in the sequencing_summary.txt file alone. By default pycoQC will automatically reorder the runs by decreasing throughput, which should normally reflect the sequencing order. However if you know the order you can specify it at initialisation with the option runid_list. This option can also be used to select specific run IDs

Minimal "pass" quality

By default pycoQC assumes that the minimal mean quality for a "pass" read is 7 (same as default Albacore value). However if you want to adjust the value, you can specify it at initialisation with min_pass_qual.

In [30]:
help (pycoQC.__init__)
Help on function __init__ in module pycoQC.pycoQC:

__init__(self, seq_summary_file, run_type=None, runid_list=[], min_pass_qual=7, iplot=True, verbose=False)
    Parse Albacore sequencing_summary.txt file and clean-up the data
    * seq_summary_file: STR
        Path to the sequencing_summary generated by Albacore 1.0.0 +
    * run_type: STR [Default None = autodetect]
        Force to us the Type of the run 1D or 1D2
    * runid_list: LIST of STR [Default []]
        Select only specific runids to be analysed. Can also be used to force pycoQC to order the runids for
        temporal plots, if the sequencing_summary file contain several sucessive runs. By default pycoQC analyses
        all the runids in the file and uses the runid order as defined in the file.
    * min_pass_qual INT [Default 7]
        Pass reads are defined throughout the package based on this threshold
    * iplot
        if False the ploting function do not plot the results but only return the a plotly.Figure object.
        This is mainly intended to non-interative ploting

In [36]:
p = pycoQC("https://www.ebi.ac.uk/~aleg/data/pycoQC_test/Albacore-2.3.1_basecall-1D-RNA_sequencing_summary.txt.gz", verbose=True, min_pass_qual=10)

Importing data

 100000 reads found in initial file

Verify and rearrange fields

 1D Run type

Filter out reads corresponding to the calibration strand

 5535 reads discarded

Filter out zero length reads

 162 reads discarded

Order run_ids by decreasing througput

 Processing reads with Run_ID 9835d20f1d205bdbd1fb4d464ae778de95beab24 / time offset: 0

Reindex and sort

 94303 Total valid reads found

Run summary

The summary method generate a simple summary table with a clickable button to switch from "all reads" to "pass reads" only

In [31]:
help(pycoQC.summary)
Help on function summary in module pycoQC.pycoQC:

summary(self, width=1400, height=None)
    Plot an interactive summary table
    * width: With of the ploting area in pixel
    * height: height of the ploting area in pixel

In [25]:
p = pycoQC("./data/Albacore-1.2.3_basecall-1D-RNA_small_sequencing_summary.txt.gz")
fig = p.summary()

Read Length and Mean quality distribution

pycoQC has 3 methods to visualize the distribution of mean quality scores and of estimated read length:

  • reads_len_1D: A distribution histogram of estimated read length in logarithmic scale
  • reads_qual_1D: A distribution histogram of mean quality scores
  • reads_len_qual_2D: A density contour plot of estimated read length vs mean quality scores in semilog scale

Although we recommend to stick to default values, all 3 methods allow users to customize the plots.

  • The numbers of bin to divide the reads quality and/or length space in can be specified with nbins for the 1D plots and len_nbins / qual_nbins for the 2D plot
  • The intensity of line smoothing (using a gaussian kernel filter) can be specified
  • Additional cosmetic customization are available: color/colorscale, width and height
In [32]:
help(pycoQC.reads_len_1D)
Help on function reads_len_1D in module pycoQC.pycoQC:

reads_len_1D(self, color='lightsteelblue', width=1400, height=500, nbins=200, smooth_sigma=2, sample=100000)
    Plot a distribution of read length (log scale)
    * color: Color of the area (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
    * width: With of the ploting area in pixel
    * height: height of the ploting area in pixel
    * nbins: Number of bins to devide the x axis in
    * smooth_sigma: standard deviation for Gaussian kernel
    * sample: If given, a n number of reads will be randomly selected instead of the entire dataset

In [27]:
p = pycoQC("./data/Albacore-2.1.10_basecall-1D-RNA_small_sequencing_summary.txt.gz")
fig = p.reads_len_1D()
In [33]:
help(pycoQC.reads_qual_1D)
Help on function reads_qual_1D in module pycoQC.pycoQC:

reads_qual_1D(self, color='salmon', width=1400, height=500, nbins=200, smooth_sigma=2, sample=100000)
    Plot a distribution of quality scores
    * color: Color of the area (hex, rgb, rgba, hsl, hsv or any CSV named colors https://www.w3.org/TR/css-color-3/#svg-color
    * width: With of the ploting area in pixel
    * height: height of the ploting area in pixel
    * nbins: Number of bins to devide the x axis in
    * smooth_sigma: standard deviation for Gaussian kernel
    * sample: If given, a n number of reads will be randomly selected instead of the entire dataset

In [29]:
p = pycoQC("./data/Albacore-2.1.10_basecall-1D-RNA_small_sequencing_summary.txt.gz")
fig = p.reads_qual_1D()
In [34]:
help(pycoQC.reads_len_qual_2D)
Help on function reads_len_qual_2D in module pycoQC.pycoQC:

reads_len_qual_2D(self, colorscale=[[0.0, 'rgba(255,255,255,0)'], [0.1, 'rgba(255,150,0,0)'], [0.25, 'rgb(255,100,0)'], [0.5, 'rgb(200,0,0)'], [0.75, 'rgb(120,0,0)'], [1.0, 'rgb(70,0,0)']], width=1400, height=600, len_nbins=None, qual_nbins=None, smooth_sigma=2, sample=100000)
    Plot a 2D distribution of quality scores vs length of the reads
    * colorscale: a valid plotly color scale https://plot.ly/python/colorscales/ (Not recommanded to change)
    * width: With of the ploting area in pixel
    * height: height of the ploting area in pixel
    * len_nbins: Number of bins to divide the read length values in (x axis)
    * qual_nbins: Number of bins to divide the read quality values in (y axis)
    * smooth_sigma: standard deviation for 2D Gaussian kernel
    * sample: If given, a n number of reads will be randomly selected instead of the entire dataset

In [16]:
p = pycoQC("./data/Albacore-2.1.10_basecall-1D-DNA_small_sequencing_summary.txt.gz")
fig = p.reads_len_qual_2D ()

Output over time

In [13]:
p  = pycoQC ("./data/Albacore-1.2.3_basecall-1D-RNA_small_sequencing_summary.txt.gz")
fig = p.output_over_time ()

Quality over time

In [27]:
p  = pycoQC ("./data/Albacore-2.1.10_basecall-1D-DNA_small_sequencing_summary.txt.gz")
fig = p.qual_over_time ()

Barcode distribution

In [15]:
p  = pycoQC ("./data/Albacore-1.2.3_basecall-1D-RNA_small_sequencing_summary.txt.gz")
fig = p.barcode_counts ()

Channels activity over time

In [18]:
p  = pycoQC ("./data/Albacore-1.7.0_basecall-1D-DNA_small_sequencing_summary.txt.gz")
fig = p.channels_activity ()